Add check_nm_dataset() for automated NONMEM dataset validation#7
Add check_nm_dataset() for automated NONMEM dataset validation#7roninsightrx merged 5 commits intomainfrom
Conversation
Adds a comprehensive validation function that runs 35+ checks on NONMEM datasets and returns a summary table (check/description/result) without throwing errors. Covers column naming, EVID/AMT/DV/MDV consistency, TIME ordering, ADDL/II overlap detection, reserved names, covariate missingness, and more. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR introduces a new exported validation utility, check_nm_dataset(), to run a broad suite of automated consistency/sanity checks on NONMEM-style datasets and return a tidy PASS/FAIL/NA summary table without throwing errors.
Changes:
- Added
check_nm_dataset()implementation with ~35+ checks and optional verbose logging. - Added a dedicated
testthatfile with 31 unit tests covering the new check categories. - Exported the function, generated Rd documentation, and bumped the package version.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
R/check_nm_dataset.R |
New dataset validation function implementing the check suite and verbose reporting. |
tests/testthat/test-check_nm_dataset.R |
New unit tests validating expected PASS/FAIL behavior for each check. |
NAMESPACE |
Exports check_nm_dataset. |
man/check_nm_dataset.Rd |
Generated documentation for the new function. |
DESCRIPTION |
Version bumped to 0.0.0.9005. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
R/check_nm_dataset.R
Outdated
| has_col <- function(col) col %in% names(data) | ||
| cols <- toupper(names(data)) | ||
|
|
There was a problem hiding this comment.
Column handling is inconsistent: some checks use cols <- toupper(names(data)) (case-insensitive), but has_col() and most downstream accessors use case-sensitive names(data) / data$TIME / data$EVID. This can cause required columns to be reported as present while subsequent checks are silently skipped (or covariates misclassified) when inputs use lowercase/mixed-case column names. Consider normalizing names once (e.g., map to uppercase internally) or implementing case-insensitive lookup helpers used consistently throughout the function.
| max_len <- max(nchar(ids)) | ||
| log_check( | ||
| "id_digits", | ||
| paste0("ID variable has <= ", max_id_digits, " digits (longer may need $TABLE FORMAT)"), | ||
| if (max_len <= max_id_digits) "PASS" else "FAIL" | ||
| ) |
There was a problem hiding this comment.
max_len <- max(nchar(ids)) will return NA if any ID is missing, and the subsequent if (max_len <= max_id_digits) will error with “missing value where TRUE/FALSE needed”. Since the function promises to not throw errors, this should use na.rm = TRUE (and handle the all-NA case explicitly) or validate/flag missing IDs before computing max_len.
| max_len <- max(nchar(ids)) | |
| log_check( | |
| "id_digits", | |
| paste0("ID variable has <= ", max_id_digits, " digits (longer may need $TABLE FORMAT)"), | |
| if (max_len <= max_id_digits) "PASS" else "FAIL" | |
| ) | |
| non_missing_ids <- ids[!is.na(ids)] | |
| if (length(non_missing_ids) == 0) { | |
| log_check( | |
| "id_digits", | |
| paste0("ID variable has <= ", max_id_digits, " digits (longer may need $TABLE FORMAT)"), | |
| "NA" | |
| ) | |
| } else { | |
| max_len <- max(nchar(non_missing_ids)) | |
| log_check( | |
| "id_digits", | |
| paste0("ID variable has <= ", max_id_digits, " digits (longer may need $TABLE FORMAT)"), | |
| if (max_len <= max_id_digits) "PASS" else "FAIL" | |
| ) | |
| } |
| if (has_col("ADDL") && has_col("II") && has_col("EVID") && has_col("ID") && | ||
| has_col("TIME") && is.numeric(data$ADDL) && is.numeric(data$II)) { | ||
| overlap_found <- FALSE | ||
| for (id in unique(data$ID)) { | ||
| sub <- data[data$ID == id, ] | ||
| dose_idx <- which(sub$EVID %in% c(1, 4)) | ||
| if (length(dose_idx) < 2) next | ||
| for (i in seq_len(length(dose_idx) - 1)) { | ||
| row_i <- dose_idx[i] | ||
| row_next <- dose_idx[i + 1] | ||
| addl_i <- sub$ADDL[row_i] | ||
| ii_i <- sub$II[row_i] | ||
| if (!is.na(addl_i) && !is.na(ii_i) && addl_i > 0 && ii_i > 0) { | ||
| last_expanded <- sub$TIME[row_i] + addl_i * ii_i | ||
| if (last_expanded >= sub$TIME[row_next]) { | ||
| overlap_found <- TRUE |
There was a problem hiding this comment.
addl_overlap detection iterates dosing records in their current row order (dose_idx), but overlap is a time-based concept and the dataset may be unsorted (you separately check data_ordered). As written, an unsorted dataset can yield false positives/negatives for overlap. Consider sorting sub by TIME (and possibly by additional tie-breakers) before computing overlaps, or only running this check when the ordering check passes.
Allows mapping non-standard column names to expected NONMEM names via a dictionary argument (e.g., list(ID = "SUBJ", TIME = "TAD")). All checks now resolve column names through the dictionary. Keys are case-insensitive and unknown keys produce a warning. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Reviewer feedback: the MDV/DV checks were internally inconsistent. mdv_dv_consistent used only is.na(DV), but dv_leq_zero_mdv expected DV <= 0 to have MDV=1, and mdv_nonmissing_obs used DV != 0 instead of DV > 0. A valid dataset with BLQ records could not pass all three. Changes: - Unified "missing/BLQ" = is.na(DV) | DV <= 0 across all MDV checks - mdv_dv_consistent now expects MDV=1 for NA or DV<=0, MDV=0 for DV>0 - Removed redundant dv_leq_zero_mdv check (covered by updated above) - Fixed mdv_nonmissing_obs filter to use DV > 0 (matching description) - Added test for BLQ (DV <= 0) MDV consistency Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The overlap check iterated dose records in row order, which could produce false positives/negatives on unsorted datasets. Now sorts dose_idx by TIME within each ID before comparing expanded dose windows against subsequent doses. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Summary
check_nm_dataset()— a comprehensive validation function that runs 35+ automated checks on NONMEM datasetsTest plan
devtools::test(filter = "check_nm_dataset")— 31/31 passing🤖 Generated with Claude Code